Note: Datset donated by Ron Kohavi and Barry Becker, from the article "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid". Small changes to the dataset have been made, such as removing the
'fnlwgt'feature and records with missing or ill-formatted entries.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45222 entries, 0 to 45221 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 45222 non-null int64 1 workclass 45222 non-null object 2 education_level 45222 non-null object 3 education-num 45222 non-null float64 4 marital-status 45222 non-null object 5 occupation 45222 non-null object 6 relationship 45222 non-null object 7 race 45222 non-null object 8 sex 45222 non-null object 9 capital-gain 45222 non-null float64 10 capital-loss 45222 non-null float64 11 hours-per-week 45222 non-null float64 12 native-country 45222 non-null object 13 income 45222 non-null object dtypes: float64(4), int64(1), object(9) memory usage: 4.8+ MB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 45222 | NaN | NaN | NaN | 38.5479 | 13.2179 | 17 | 28 | 37 | 47 | 90 |
| workclass | 45222 | 7 | Private | 33307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education_level | 45222 | 16 | HS-grad | 14783 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| education-num | 45222 | NaN | NaN | NaN | 10.1185 | 2.55288 | 1 | 9 | 10 | 13 | 16 |
| marital-status | 45222 | 7 | Married-civ-spouse | 21055 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| occupation | 45222 | 14 | Craft-repair | 6020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| relationship | 45222 | 6 | Husband | 18666 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| race | 45222 | 5 | White | 38903 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| sex | 45222 | 2 | Male | 30527 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| capital-gain | 45222 | NaN | NaN | NaN | 1101.43 | 7506.43 | 0 | 0 | 0 | 0 | 99999 |
| capital-loss | 45222 | NaN | NaN | NaN | 88.5954 | 404.956 | 0 | 0 | 0 | 0 | 4356 |
| hours-per-week | 45222 | NaN | NaN | NaN | 40.938 | 12.0075 | 1 | 40 | 40 | 45 | 99 |
| native-country | 45222 | 41 | United-States | 41292 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| income | 45222 | 2 | <=50K | 34014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Number of observations: 45222 Number of people with income > 50k: 11208 Number of people with income <= 50k: 34014 Percent of people with income > 50k: 24.78
Before this data can be used for modeling and application to machine learning algorithms, it must be cleaned, formatted, and structured.
Factor names with special characters, like -, can cause issues, so a cleaning may prove helpful.
Working with categorical variables often involves transforming strings to some other value, frequently 0 or 1 for binomial factors, and {X = x_{0}, x_{1}, ..., x_{n} | 0, 1, .. n} multinomial.
These values may be ordinal (i.e. values with relationships that can be compared as a ranking, e.g. worst, better, best), or nominal (i.e. values indicate a state, e.g. blue, green, yellow).
==================================== Mapping for variable: numeric_income
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | <=50K | 0 |
| 1 | >50K | 1 |
======================================= Mapping for variable: numeric_workclass
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | State-gov | 0 |
| 1 | Self-emp-not-inc | 1 |
| 2 | Private | 2 |
| 3 | Federal-gov | 3 |
| 4 | Local-gov | 4 |
| 5 | Self-emp-inc | 5 |
| 6 | Without-pay | 6 |
============================================ Mapping for variable: numeric_marital_status
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Never-married | 0 |
| 1 | Married-civ-spouse | 1 |
| 2 | Divorced | 2 |
| 3 | Married-spouse-absent | 3 |
| 4 | Separated | 4 |
| 5 | Married-AF-spouse | 5 |
| 6 | Widowed | 6 |
======================================== Mapping for variable: numeric_occupation
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Adm-clerical | 0 |
| 1 | Exec-managerial | 1 |
| 2 | Handlers-cleaners | 2 |
| 3 | Prof-specialty | 3 |
| 4 | Other-service | 4 |
| 5 | Sales | 5 |
| 6 | Transport-moving | 6 |
| 7 | Farming-fishing | 7 |
| 8 | Machine-op-inspct | 8 |
| 9 | Tech-support | 9 |
| 10 | Craft-repair | 10 |
| 11 | Protective-serv | 11 |
| 12 | Armed-Forces | 12 |
| 13 | Priv-house-serv | 13 |
========================================== Mapping for variable: numeric_relationship
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Not-in-family | 0 |
| 1 | Husband | 1 |
| 2 | Wife | 2 |
| 3 | Own-child | 3 |
| 4 | Unmarried | 4 |
| 5 | Other-relative | 5 |
================================== Mapping for variable: numeric_race
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | White | 0 |
| 1 | Black | 1 |
| 2 | Asian-Pac-Islander | 2 |
| 3 | Amer-Indian-Eskimo | 3 |
| 4 | Other | 4 |
================================= Mapping for variable: numeric_sex
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Male | 0 |
| 1 | Female | 1 |
============================================= Mapping for variable: numeric_education_level
| Factor Value | Numerical Value | |
|---|---|---|
| 0 | Doctorate | 0 |
| 1 | Prof-school | 1 |
| 2 | Masters | 2 |
| 3 | Bachelors | 3 |
| 4 | Assoc-voc | 4 |
| 5 | Assoc-acdm | 5 |
| 6 | Some-college | 6 |
| 7 | HS-grad | 7 |
| 8 | 12th | 8 |
| 9 | 11th | 9 |
| 10 | 10th | 10 |
| 11 | 9th | 11 |
| 12 | 7th-8th | 12 |
| 13 | 5th-6th | 13 |
| 14 | 1st-4th | 14 |
| 15 | Preschool | 15 |
For training an algorithm, it is useful to separate the label, or dependent variable ($Y$) from the rest of the data training_features, or independent variables ($X$).
The features capital_gain and capital_loss are positively skewed (i.e. have a long tail in the positive direction).
To reduce this skew, a logarithmic transformation, $\tilde x = \ln\left(x\right)$, can be applied. This transformation will reduce the amount of variance and pull the mean closer to the center of the distribution.
Why does this matter: The extreme points may affect the performance of the predictive model.
Why care: We want an easily discernible relationship between the independent and dependent variables; the skew makes that more complicated.
Why DOESN'T this matter: The distribution of the independent variables is not an assumption of most models, but the distribution of the residuals and homoskedasticity of the independent variable, given the independent variables, $E\left(u | x\right) = 0$ where $u = Y - \hat{Y}$ is of linear regression. In this analysis, the dependent variable is categorical (i.e. discrete or non-continuous) and linear regression is not an appropriate model.
| Feature | Skewness | Mean | Variance | |
|---|---|---|---|---|
| 0 | Capital Loss | 4.516154 | 88.595418 | 1.639858e+05 |
| 1 | Capital Gain | 11.788611 | 1101.430344 | 5.634525e+07 |
| Feature | Skewness | Mean | Variance | |
|---|---|---|---|---|
| 0 | Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| 1 | Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| 2 | Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| 3 | Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |
| Feature | Skewness | Mean | Variance |
|---|---|---|---|
| Capital Loss | 4.516154 | 88.595418 | 163985.81018 |
| Capital Gain | 11.788611 | 1101.430344 | 56345246.60482 |
| Log Capital Loss | 4.271053 | 0.355489 | 2.54688 |
| Log Capital Gain | 3.082284 | 0.740759 | 6.08362 |